Node Identification for Distributed Shared Memory System

ABSTRACT

An example embodiment of the present invention provides processes relating to a connection/communication protocol and a memory-addressing scheme for a distributed shared memory system. In the example embodiment, a logical node identifier comprises bits in the physical memory addresses used by the distributed shared memory system. Processes in the embodiment include logical node identifiers in packets which conform to the protocol and which are stored in a connection control block in local memory. By matching the logical node identifiers in a packet against the logical node identifiers in the connection control block, the processes ensure reliable delivery of packet data. Further, in the example embodiment, the. logical node identifiers are used to create a virtual server consisting of multiple nodes in. the distributed shared memory system.

CROSS-REFERENCE TO RELATED APPLICATIONS

The present application is a continuation application of pending U.S.patent application Ser. No. 11/740,432 filed Apr. 26, 2007 and entitled“Node Identification for Distributed Shared Memory System,” which isrelated to the following commonly-owned U.S. utility patent application,filed on Jan. 29, 2007, whose disclosure is incorporated herein byreference in its entirety for all purposes: U.S. patent application Ser.No. 11/668,275, entitled “Fast Invalidation for Cache Coherency inDistributed Shared Memory System”.

TECHNICAL FIELD

The present disclosure relates to an identification process for thenodes in a distributed shared memory system.

BACKGROUND

A distributed shared memory (DSM) is a multiprocessor system in whichthe processors in the system are connected by a scalable interconnect,such as an InfiniBand switched fabric communications link, instead of abus. DSM systems present a single memory image to the user, but thememory is physically distributed at the hardware level. Typically, eachprocessor has access to a large shared global memory in addition to alimited local memory, which might be used as a component of the largeshared global memory and also as a cache for the large shared globalmemory. Naturally, each processor will access the limited local memoryassociated with the processor much faster than the large shared globalmemory associated with other processors. This discrepancy in access timeis called non-uniform memory access (NUMA).

A major technical challenge in DSM systems is ensuring that the eachprocessor's memory cache is consistent with each other processor'smemory cache. Such consistency is called cache coherence. To maintaincache coherence in larger distributed systems, additional hardware logic(e.g., a chipset) or software is used to implement a coherence protocol,typically directory-based, chosen in accordance with a data consistencymodel, such as strict consistency. DSM systems that maintain cachecoherence are called cache-coherent NUMA (ccNUMA).

Typically, if additional hardware logic is used, a node in the systemwill comprise a chip that includes the hardware logic and one or moreprocessors and will be connected to the other nodes by the scalableinterconnect. For purposes of initial connection and later communicationbetween nodes, the system might employ node identifiers, e.g., serial,random, or centrally assigned numbers, which in turn might be used aspart of an address for physical memory residing on the node.

SUMMARY

In particular embodiments, the present invention provides methods,apparatuses, and systems directed to node identification in a DSMsystem. In one particular embodiment, the present invention providesnode-identification processes for use with a connection/communicationprotocol and a memory-addressing, scheme in a DSM system.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 is a block diagram showing a DSM system, which system might beused with some embodiments of the present invention.

FIG. 2 is a block diagram showing some of the physical and functionalcomponents of an example DSM-management chip or logic circuit, whichchip might be used as part of a node with some embodiments of thepresent invention.

FIG. 3 is a diagram showing the format of an RDP over Ethernet packetand its header, which formats might be used in some embodiments of thepresent invention.

FIG. 4 is a diagram showing the format of an RDP over InfiniBand packetand its header, which formats might be used in some embodiments of thepresent invention.

FIG. 5 is a diagram showing the format of an RDP packet, its header, andits optional trailer, which formats might be used in some embodiments ofthe present invention.

FIG. 6 is a diagram showing the format of a connection control block,which format might be used in some embodiments of the present invention.

FIG. 7 is a diagram showing an example illustrating the use of LNIDswith respect to the RDP protocol, which protocol might be used with anembodiment of the present invention.

FIG. 8 is a diagram showing a flowchart of an example process forbuilding an RDP packet for transmission over the switched fabricnetwork, which process might be used with an embodiment of the presentinvention.

FIG. 9 is a diagram showing a flowchart of an example process forvalidating an RDP packet received over the switched fabric network,which process might be used with an embodiment of the present invention.

FIG. 10 is a diagram showing the format of a 40-bit physical memoryaddress in a 16-node DSM system and the format of a 40-bit physicalmemory address in a 256-node DSM system, which formats might be usedwith embodiments of the present invention.

FIG. 11 is a diagram showing, for didactic purposes, the local views ofa physical address space for a virtual server comprised of three nodes.

FIG. 12 is a diagram showing a flowchart of an example process foraltering a physical memory address prior to transmission over aHyperTransport bus, which process might be used with an embodiment ofthe present invention.

FIG. 13 is a diagram showing a flowchart of an example process foraltering a physical memory address prior to transmission over a switchedfabric, which process might be used with an embodiment of the presentinvention.

DETAILED DESCRIPTION OF THE INVENTION

The following example embodiments are described and illustrated inconjunction with apparatuses, methods, and systems which are meant to beexamples and illustrative, not limiting in scope.

A. ccNUMA DMA System with DSM-Management Chips

A DSM system has been developed that provides cache-coherent non-uniformmemory access (ccNUMA) through the use of a DSM-management chip. In aparticular embodiment, a DSM system may comprise a distributed computernetwork of up to 16 nodes, connected by a switched fabric, where eachnode includes two or more Opteron CPUs and one DSM management chip. Inanother embodiment, this DSM system comprises up to 256 nodes connectedby the switched fabric.

The DSM system allows the creation of a multi-node virtual server whichis a virtual machine consisting of multiple CPUs belonging to two ormore nodes. In some embodiments, the nodes use aconnection/communication protocol to communicate with each other andwith virtual I/O servers in the DSM system. Enforcement of theconnection/communication protocol is also handled by the DSM-managementchip. Consequently, virtual I/O servers include a DSM-management chip,though they do not contribute any physical memory to the DSM system andconsequently do not make use of the chip's functionality directlyrelated to cache coherence, in particular embodiments. For a furtherdescription of a virtual I/O server, see U.S. patent application Ser.No. 11/624,542, entitled “Virtualized Access to I/O Subsystems”, andU.S. patent application Ser. No. 11/624,573, entitled “VirtualInput/Output Server”, both fled on Jan. 18, 2007 which are incorporatedherein by reference for all purposes. As explained below, theconnection/communication protocol uses an identifier called a logicalnode identifier (LNID) to identify source and destination nodes forpackets that travel over the switched fabric.

FIG. 1 is a diagram showing a ccNUMA DSM system, which system might beused with a particular embodiment of the invention. In this DSM system,four nodes (labeled 101, 102, 103, and 104) are connected to each otherover a switched fabric (labeled 105) such as Ethernet or InfiniBand. Inturn, each of the four nodes includes two Opteron CPUs, a DSM-managementchip, and memory in the form of DDR2 S DRAM (double-data-rate twosynchronous dynamic random access memory). In this embodiment, eachOpteron CPU includes a local main memory connected to the CPU. This DSMsystem provides NUMA (non-uniform memory access) since each CPU canaccess its own local main memory faster than it can access the othermemories shown in FIG. 1.

Also as shown in FIG. 1, a block of memory has its “home” in the localmain memory of one of the Opteron CPUs in node 101. That is to say, thislocal main memory is where the system's version of the memory block isstored, regardless of whether there are any cached copies of the block.Such cached copies are shown in the DDR2s for nodes 103 and 104. TheDSM-management chip includes hardware logic (e.g., the CMM) to enforce acoherence protocol and make the DSM system cache-coherent (e.g., ccNUMA)when multiple nodes are caching copies of the same block of memory.

B. Example System Architecture of a DSM-Management Chip

FIG. 2 is diagram showing the physical and functional components of aDSM-management chip, which chip might be used as part of a node withparticular embodiments of the invention. The DSM-management chipincludes interconnect functionality facilitating communications with oneor more processors, which might be Opteron processors offered byAdvanced Micro Devices (AMD), Inc., of Sunnyvale, Calif., in someembodiments. As FIG. 2 illustrates, the DSM-management chip includes twoHyperTransport Managers (HTM), each of which manages communications toand from a processor over a HT (HyperTransport) bus. More specifically,an HTM provides the PHY and link layer functionality for a cachecoherent HT interface such as Opteron's ccHT. The HTM captures allreceived HT packets in a set of receive queues per interface (e.g.,posted/non-posted command, request command, probe command and data)which are consumed by the Coherent Memory Manager (CMM). The HTM alsocaptures packets from the CMM in' a similar set of transmit queues perinterface and transmits those packets on the HT interface. As a resultof the two HTMs, the DSM-management chip becomes a coherent agent withrespect to any bus snoops broadcast over the cache-coherent HT bus by aprocessor's memory controller. Of course, other inter-chip or buscommunications protocols might be used in other embodiments of thepresent invention.

Also as shown in FIG. 2, the two HTMs are connected to a Coherent MemoryManager (CMM), which enforces a coherence protocol and thereby providescache-coherent access to memory shared by the nodes that are part of theDSM fabric. In addition to interfacing with the Opteron processorsthrough the HTM, the CMM interfaces with the fabric via the RDM(Reliable Delivery Manager). Additionally, the CMM provides interfacesto the HTM for DMA (Direct Memory Access) and configuration.

In some embodiments, the CMM behaves like both a processor cache on acache-coherent (e.g., ccHT) bus and a memory controller on acache-coherent (e.g., ccHT) bus, depending on the scenario. Inparticular, when a processor on a node performs an access to a home (orlocal) memory address, the home (or local) memory will generate a proberequest that is used to snoop the caches of all the processors on thenode. The CMM will use this probe to determine if it has exported theblock of memory containing that address to another node and may generateDSM probes (over the fabric) to respond appropriately to the initialprobe. In this scenario, the CMM behaves like a processor cache on thecache-coherent bus.

When a processor on a node performs an access to a remote memory, theprocessor will direct this access to the CMM. The CMM will examine therequest and satisfy it from the local cache, if possible, and, in theprocess, generate any appropriate probes. If the request cannot besatisfied from the local cache, the CMM will send a DSM request to theremote memory's home node to (a) fetch the block of memory that containsthe requested data or (b) request a state upgrade. In this case, the CMMwill wait for the DSM response before it responds back to the processor.In this scenario, the CMM behaves like a memory controller on the ccHTbus.

The RDM manages the flow of packets across the DSM-management chip's twofabric interface ports. The RDM has two major clients, the CMM and theDMA Manager (DMM), which initiate packets to be transmitted and consumereceived packets. The RDM ensures reliable end-to-end delivery ofpackets using a connection/communication protocol called ReliableDelivery Protocol (RDP). On the fabric side, the RDM interfaces to theselected link/MAC (XGM for Ethernet, IBL for InfiniBand) for each of thetwo fabric ports. In particular embodiments, the fabric might connectnodes to other nodes. In other embodiments, the fabric might alsoconnect nodes to virtual IO servers. In particular embodiments, theprocesses using LNIDs described below might be executed by the RDM.

The XGM provides a 10 G Ethernet MAC function, which includes framing,inter-frame gap handling, padding for minimum frame size, Ethernet FCS(CRC) generation and checking, and flow control using PAUSE frames. TheXGM supports two link speeds: single data rate XAUI (10 Gbps) and doubledata rate XAUI (20 Gbps). In particular embodiments, the DSM-managementchip has two instances of the XGM, one for each fabric port. Each XGMinstance interfaces to the RDM, on one side, and to the associated PCS,on the other side.

The IBL provides a standard 4-lane IB link layer function, whichincludes link initialization, link state machine, CRC generation andchecking, and flow control. The IBL block supports two link speeds,single data rate (8 Gbps) and double data rate (16 Gbps), with automaticspeed negotiation. In particular embodiments, the DSM-management chiphas two instances of the IBL, one for each fabric port. Each IBLinstance interfaces to the RDM, on one side, and to the associatedPhysical Coding Sub-layer (PCS), on the other side.

The PCS, along with an associated quad-serdes, provides physical layerfunctionality for a 4-lane InfiniBand SDR/DDR interface, or a 10 G/20 GEthernet XAUI/10GBase-CX4 interface. In particular embodiments, theDSM-management chip has two instances of the PCS, one for each fabricport. Each PCS instance interfaces to the associated IBL and XGM.

The DMM shown in FIG. 2 manages and executes direct memory access (DMA)operations over RDP, interfacing to the CMM block on the host side andthe RDM block on the fabric side. For DMA, the DMM interfaces tosoftware through the DmaCB table in memory and the on-chip DMA executionand completion queues. The DMM also handles the sending and receiving ofRDP interrupt messages and non-RDP packets, and manages the associatedinbound and outbound queues.

The DDR2 SDRAM Controller (SDC) attaches to a one or two external240-pin DDR2 SDRAM DIMM, which is actually external to theDMS-management chip, as shown in both FIG. 1 and FIG. 2. In particularembodiments, the SDC provides SDRAM access for the CMM and the DMM.

In some embodiments, the DSM-management chip might comprise anapplication specific integrated circuit (ASIC), whereas in otherembodiments the chip might comprise a field-programmable gate array(FPGA). Indeed, the logic encoded in the chip could be implemented insoftware for DSM systems whose requirements might allow for longerlatencies with respect to cache coherence, DMA, interrupts, etc.

C. RDP Packets and Their Headers

FIG. 3 is a diagram showing the format of a packet for RDP over Ethernetand the packet's header, which formats might be used in some embodimentsof the present invention. When RDP runs over the Ethernet MAC layer, anRDP packet is encapsulated in an Ethernet MAC frame. The Ethernet headerof an encapsulated RDP packet is a VLAN-tagged header (where VLAN standsfor virtual local area network). In FIG. 3, SA identifies the 6-bytesource MAC address and DA identifies the 6-byte destination MAC address.

The Reliable Delivery Protocol allows RDP and non-RDP packets toco-exist on the same fabric. When RDP runs over the Ethernet MAC layer,RDP and non-RDP packets are distinguished from each other by thepresence of the VLAN header and the value of the Length/Type fieldfollowing it. For an RDP packet: (a) the VLAN header is present, i.e.,the first Length/Type field (following the last SA byte) has a value of0x0081; and (b) the second Length/Type field (following the VLAN header)has a value less than 1536 (frame length). An Ethernet frame that doesnot satisfy both of the above conditions is a non-RDP packet.

FIG. 4 is a diagram showing the format of a packet for RDP overInfiniBand and the packet's header, which formats might be used in someembodiments of the present invention. It will be appreciated that theheader includes fields for Source Local ID and Destination Local ID.When RDP runs over the IB link layer, an RDP packet is encapsulated intoan IB packet. The format of an IB Local Transport Packet is used,although the 12-byte Base Transport Header (BTH) which is normallypresent after the Local Route Header (LRH) is replaced by the RDP header(8 bytes) and the first 4 bytes of the RDP payload. From the standpointof the 113 standard, bits 31:24 of the first DWORD of the RDP Header isthe OpCode field of Base Transport Header (BTH). The most significanttwo bits (31:30) of that field have a fixed value of 0x3 (binary 11) forRDP packets, which specifies a ‘Manufacturer Specific OpCode’. The Rsv8field of the BTH (bits 31:24 of the second DWORD) is not protected bythe 32-bit IB Invariant CRC (ICRC). This corresponds to the mostsignificant 8 bits of the DstLNID. Thus, these bits do not haveend-to-end protection but do have point-to-point protection by the16-bit Variant CRC (VCRC), which presents an insignificant risk offailure since the DstLNID is only used as a packet validation field atthe destination node in conjunction with many other validation fields. Afalse match of a corrupted LNID MSB (most significant bit) with goodVCRC has very low probability and would only occur if the connectionparameters were set up inconsistently at the source and destinationnodes.

When RDP runs over the InfiniBand link layer, RDP and non-RDP packetsare distinguished by the values of the LNH field in the IB Local RouteHeader and the QpCode field in the IB Base Transport Header. For an RDPpacket: (a) LNH=0x2 (IBA Local); and (b) OpCode bits [7:6]=0x3(Manufacturer Specific OpCode). An InfiniBand packet that does notsatisfy both of the above conditions is a non-RDP packet.

FIG. 5 is a diagram showing the format of an RDP packet and its header,which formats might be used in some embodiments of the presentinvention. An RDP packet consists of a header, payload, and optionaltrailer. As shown in FIG. 5, another field in the RDP packet is theSrcLNID (Source Logical Node ID) which identifies the packet's sourcenode. This is the connection identifier (i.e., remote LNID) at thedestination node. This field is also 16 bits wide. Also as shown in FIG.5, one of the fields in an RDP packet is the DestLNID (DestinationLogical Node ID) which identifies the packet's destination node. This isthe connection identifier (i.e., remote LNID) at the source node. Thisfield is 16 bits wide.

D. Using LNIDs with RDP

In particular embodiments, the DSM system uses a software data structurecalled the connection control block (CCB), stored in local memory suchas the local main memory shown in FIG. 1, to facilitate implementationof the RDP protocol. The RDM uses a received packet's source LNID as anindex into the CCB to find an entry for the connection corresponding tothe packet. FIG. 6 is a diagram showing the format of a CCB entry for asingle connection, which format might be used in some embodiments of thepresent invention. As shown in FIG. 6, each entry records the fabricaddress for two paths, Path 0 and Path 1, which may correspond to thetwo fabric interface ports shown connected to the RDM in FIG. 2. Inother embodiments, there might be more than two paths, corresponding tomore than two fabric interface ports. It will be appreciated that theCCB entry has a field called MY LNID, which identifies the LNID for theRDM's node.

For an RDP connection between a pair of nodes, the node at each end usesan LNID to refer to the node at the other end. Within a multi-nodevirtual server (VS), every node is assigned a unique LNID, possibly bysome management entity for the DSM system. For example, within athree-node VS, the LNID values might be 0, 1, and 2, or 1, 3, and 4,i.e., they not need to be sequentially incrementing from 0. In addition,every server (multi-node virtual server or standalone server) assigns aunique LNID to each node that communicates with it. For example, astandalone server node that communicates with the virtual serverdescribed above might be assigned an LNID value of 16 by the VS. If thatsame node communicates with another server, it may be assigned the sameLNID or a different LNID by that server. Therefore, LNID assignments areunique from the standpoint of a given server, but they are not uniqueacross servers.

An example of LNID assignments is shown in FIG. 7. In the example, avirtual computing environment (VCE) consists of two virtual servers (Aand B), an application server (C), and a virtual I/O server (D). In thisexample, virtual server A assigns LNID values 0, 1, and 2 to each of itsown nodes (VS nodes A0, A1, and A2, respectively) and an LNID value of16 to virtual I/O server D. Virtual server B assigns values of I and 5to each of its own nodes (VS nodes B1 and B5, respectively) and an LNIDvalue of 18 to virtual I/O server D. Application server C assigns anLNID value of 3 to virtual I/O server D. Virtual I/O server D assignsLNID values 0, 2, and 4, to VS nodes A0, A1 and A2, respectively, andLNID values of 6 and 8 to VS nodes B1 and B5. Finally, virtual I/Oserver D assigns a value of 10 to application server C. These variousassignments are collected and summarized in Table 7.1 in FIG. 7.

Table 7.2 shows the SrcLNID and DstLNID values used in the headers ofRDP packets exchanged between different node pairs. For example, VSnodes A0 and A1 both belong to virtual server A, so a packet from A0 toA1 will have a SrcLNID value of 0 (LNID assigned to A0 by VS A), and aDstLNID value of 1 (LNID assigned to A1 by VS A). As another example, apacket from A1 to I/O server D will have a SrcLNID value of 2 (LNIDassigned to A1 by I/O server D) and a DstLNID value of 16 (LNID assignedby V S A to I/O server D).

FIG. 8 is a diagram showing a flowchart of an example process forbuilding an RDP packet for transmission over the switched fabricnetwork, which process might be used with an embodiment of the presentinvention. In the process's first step 801, the node's Reliable DeliveryManager (RDM) receives a DestLNID and data for an RDP packet from thenode's CMM or DMM. The RDM uses the packet's DestLNID to look up theentry corresponding to the DestLNID in the Connection Control Block(CCB), in step 802.1 f there is no corresponding entry, the RDM sends anerror message to the CMM or DMM, as the case may be. Then in step 803,the RDM builds an RDP header for an RDP packet for the data, using theDestLNID and the CCB entry's MY LNID value. In step 804, the RDM buildsa fabric header for the RDP packet, using information in the CCB entry'sremote fabric address. Once the RDP packet is complete, the RDM sendsthe packet to the fabric link for transmission to the remote node, instep 805.

FIG. 9 is a diagram showing a flowchart of an example process forvalidating an RDP packet received over the switched fabric network,which process might be used with an embodiment of the present invention.In the process's first step 901, a node's RDM receives an RDP packetover the switched fabric network. The RDM then checks to see whether thepacket's destination fabric address (e.g., the 6-byte MAC DA in anEthernet header or the Destination Local ID in an Infiniband LRH)matches the node's fabric address, in step 902. If not, the RDM discardsthe packet. Otherwise, the RDM goes' to step 903 and determines whetherthe packet is an RDP packet. If not, the RDM will process the packet asa non-RDP packet, in step 904. Otherwise, if the packet is an RDPpacket, the RDM uses the packet's SrcLNID to look up the entrycorresponding to the SrcLNID in the Connection Control Block (CCB), instep 905. If there is no corresponding entry, the RDM discards thepacket. Then the RDM goes to step 906 and checks to make sure that thepacket's source fabric address (e.g., the 6-byte MAC SA in an Ethernetheader or the Source Local ID in an Infiniband LRH) matches the CCBentry's remote fabric address (e.g., for Path 0 or Path 1). If not, theRDM discards the packet. Otherwise, the RDM checks to determine whetherthe packet's DestLNID matches the CCB entry's MY_LNID, in step 907. Ifnot, the RDM discards the packet. But if there is a match, the RDMforwards the packet to the CMM or DMM for further processing.

E. Using LNIDs With Memory-Addressing Scheme

As indicated earlier, the DSM system also uses LNIDs in itsmemory-addressing scheme. In particular embodiments, the physical memoryaddress width is 40-bits (e.g., in DSM systems that use the presentgeneration of Opteron CPUs), though it will be appreciated that thereare numerous other suitable widths. FIG. 10 is a diagram showing theformat of a 40-bit physical memory address in a 16-node DSM system andthe format of a 40-bit physical memory address in a 256-node DSM system.As shown in FIG. 10, the four most significant bits comprise an LNID inthe 16-node DSM system and the eight most significant bits comprise anLNID in the 256-node DSM system.

In particular embodiments of the DSM system, the physical address spacefor a virtual server is arranged so that the local node's memory alwaysstarts at address 0 (zero). One reason for using this arrangement iscompatibility with legacy system software, in particular embodiments.Specifically, with local memory starting at address 0, system software(e.g., boot code) accesses local memory the same way that it does on astandard server. Another reason for using this arrangement is that itsimplifies the address lookup in the CMM. For a memory read/writerequest from a local processor, an address in the lower 1116th or11256th segment of the 40-bit address space is always local and allother addresses map to memory in other nodes.

To see how the arrangement works, consider the example of a virtualserver consisting of three nodes: 0, 1, and 2. In a 16-node DSM system,the total addressable memory space for this virtual server would be 1terabyte (2̂40) and each node would be allocated a segment which is 1116of that space (64 GB or 2̂36). From a global view, the first 64 GBsegment of the physical address space starting at address 0 would beallocated to node 0 (i.e., the node whose LNID equals 0), the next 64 GBsegment to node 1, and the following segment to node 2. The remaining 13segments would be unused since LNIDs 4-15 are not used.

FIG. 11 shows this physical address space from the local view of each ofthe three nodes in the virtual server. The local view of node 0 would bethe same as the global view and is shown in FIG. 11 under the label“Node 0”, with Local Memory (0) first, Node 1 Memory second, and Node 2Memory third. The local view of node 1 would be as shown under the label“Node 1”, with Local Memory (1) first, Node 0 Memory second, and Node 2Memory third. And the local view of node 2 would be as shown under thelabel “Node 2”, with Local Memory (2) first, Node I Memory second, andNode 0 Memory third.

It will be appreciated that in order to accomplish this arrangement, thelocations of the local segment and the node 0 segment are swapped in theaddress map. And since MY_LNID, as defined above, is the LNID assignedto the local node, this is equivalent to swapping MY_LNID with LNID 0 inthe address map. However, such a swapping would create confusion in theDSM system if it were applied to memory traffic leaving the node ver theswitched fabric. Therefore, the node's CMM reverses the swapping fortraffic leaving the node.

FIG. 12 is a diagram showing a flowchart of an example process foraltering a physical memory address, by the swapping a described above,prior to transmission over a HyperTransport bus. In the process's firststep 1201, a node's CMM receives a memory operation (e.g., a read,write, or probe) pertaining to a physical memory address from the RDM onthe DSM-management chip. In step 1202, the CMM determines whether thefour (or eight) most significant bits in the physical address are equalto: (1) the MY LNID value for the node; or (2) zero. If so, the CMM goesto step 1203, where: (1) if those bits are equal to the MY_LNID value,the CMM sets the bits to zero (e.g., by changing to zero the four (oreight) most significant bits in the physical memory address) beforetransmission of the operation over the HyperTransport bus; and (2) ifthose bits are equal to zero, the CMM sets those bits to MY_LNID (e.g.,by changing to MY LNID the four (or eight) most significant bits in thephysical memory address) before transmission of the operation over theHyperTransport bus. Otherwise, if those bits are not equal to MY_LNID orzero, the CMM goes to step 1204 and allows the memory operation toproceed without processing relating to LNID swapping.

FIG. 13 is a diagram showing a flowchart of an example process foraltering a physical memory address, by reversing the swapping asdescribed above, prior to transmission over a switched fabric. In theprocess's first step 1301, a node's CMM receives a memory operation(e.g., a read, write, or probe) pertaining to a physical memory addressfrom one of the node's CPUs over the HyperTransport (e.g., ccHT) busthat connects the node's CPUs to the node's DSM-management chip. In step1302, the CMM determines whether the four (or eight) most significantbits in the physical address are equal to: (1) the MY_LNID value for thenode; or (2) zero. If so, the CMM goes to step 1303, where: (1) if thosebits are equal to the MY_LNID value, the CMM sets the DstLNID value tozero (e.g., by changing to zero the four (or eight) most significantbits in the physical memory address) before transmission of theoperation to the RDM; and (2) if those bits are equal to zero, the CMMsets the DstLNID value to MY_LNID (e.g. by changing to MY_LNID the four(or eight) most significant bits in the physical memory address) beforetransmission of the operation to the RDM. Otherwise, if those bits arenot equal to MY_LNID or zero, the CMM goes to step 1304 and allows thememory operation to proceed without processing relating to LNIDswapping, if the physical memory address is not for exported localmemory. (If the physical memory address is for exported local memory, aprobe operation to another physical memory address might result, feedingback into the process at step 1301.)

Particular embodiments of the above-described processes might becomprised of instructions that are stored on storage media. Theinstructions might be retrieved and executed by a processing system. Theinstructions are operational when executed by the processing system todirect the processing system to operate in accord with the presentinvention. Some examples of instructions are software, program code,firmware, and microcode. Some examples of storage media are memorydevices, tape, disks, integrated circuits, and servers. The term“processing system” refers to a single processing device or a group ofinter-operational processing devices. Some examples of processingdevices are integrated circuits and logic circuitry. Those skilled inthe art are familiar with instructions, storage media, and processingsystems.

Those skilled in the art will appreciate variations of theabove-described embodiments that fall within the scope of the invention.In this regard, it will be appreciated that there are many otherpossible orderings of the steps in the processes described above andmany other possible modularizations of those orderings. Also, it will beappreciated that the above processes relating to memory-addressing willwork with physical memory addresses that exceed 40-bits in width and DSMsystems that have more than 256 nodes. Further, it will be appreciatedthat the DSM system will work with nodes whose CPUs are not Opteronshaving a ccHT bus. As a result, the invention is not limited to thespecific examples and illustrations discussed above, but only by thefollowing claims and their equivalents.

1. A method, comprising: receiving, at a distributed shared memorycircuit of a first node in a distributed shared memory system, a messagefrom a second node in the distributed shared memory system comprising aplurality of nodes each having a unique logical unit identifier, whereinthe message indicates a memory operation related to a local memory ofthe first node and identifies a memory address; if a first plurality ofcontiguous bits of the memory address equal a logical node identifier ofthe first node, changing the first plurality of contiguous bits to apredetermined value; if the first plurality of contiguous bits of thememory address equal the predetermined value, changing the firstplurality of contiguous bits to the logical node identifier of the firstnode; forwarding the message to a processor of the first node forprocessing.
 2. The method of claim 1 wherein the predetermined value iszero.
 3. The method of claim 1 wherein the first set of contiguous bitsof the memory address are the most significant bits.
 4. The method ofclaim 1 wherein the plurality of nodes internally access theirrespective local memories having the first plurality of contiguous bitsset to the predetermined value.
 5. The method of claim 1 wherein theplurality of nodes access the local memory of the node having a logicalunit identifier equal to the predetermined value using its ownrespective logical node identifier.
 6. The method of claim 1 wherein thememory operation is a read command.
 7. The method of claim 1 wherein thememory operation is a write command.
 8. The method of claim 1 whereinthe memory operation is a probe.
 9. A method comprising receiving, at adistributed shared memory circuit of a first node in a distributedshared memory system, a message from a processor of the first nodeidentifying a memory operation related to a local memory of a secondnode in the distributed shared memory system comprising a plurality ofnodes each having a unique logical unit identifier, wherein the messageidentifies a memory address; if a first plurality of contiguous bits ofthe memory address equal a logical node identifier of the first node,changing the first plurality of contiguous bits to a predeterminedvalue; if the first plurality of contiguous bits of the memory addressequal the predetermined value, changing the first plurality ofcontiguous bits to the logical node identifier of the first node;forwarding the message to the second node for processing.
 10. The methodof claim 9 wherein the predetermined value is zero.
 11. The method ofclaim 9 wherein the first set of contiguous bits of the memory addressare the most significant bits.
 12. The method of claim 9 wherein theplurality of nodes internally access their respective local memorieshaving the first plurality of contiguous bits set to the predeterminedvalue.
 13. The method of claim 9 wherein the plurality of nodes accessthe local memory of the node having a logical unit identifier equal tothe predetermined value using its own respective logical nodeidentifier.
 14. The method of claim 9 wherein the memory operation is aread command.
 15. The method of claim 9 wherein the memory operation isa write command.
 16. The method of claim 9 wherein the memory operationis a probe.
 17. A distributed shared memory system, comprising: aplurality of interconnected nodes, wherein each node has a logical nodeidentifier comprising a plurality of contiguous bits; wherein each ofthe nodes comprises one or more processors and a local memory; andwherein each of the nodes further comprises a distributed memory logiccircuit operative to share the local memory of a respective node in adistributed shared memory system to create a shared memory in connectionwith other nodes of the plurality of nodes accessible using binaryaddresses comprising a plurality of bits, wherein a first set ofcontiguous bits of the binary addresses of the shared memory correspondto a logical node identifier of a node in the plurality of nodes, andwherein the one or more processors of each of the nodes are operative toaccess the local memory of its own node having the first set ofcontiguous bits of the binary addresses set to a uniform predeterminedvalue; and wherein the distributed memory logic circuit is furtheroperative to map the uniform predetermined value to the logical nodeidentifier of the local node in memory management traffic transmittedbetween the nodes that include binary addresses of the shared memory.18. The system of claim 17 wherein each of the one or more processorsaccess the local memory of the node having a logical node identifierequal to the predetermined value using the logical node identifier ofits own node.
 19. The method of claim 17 wherein the predetermined valueis zero.
 20. The method of claim 17 wherein the first set of contiguousbits of the memory address are the most significant bits.
 21. The methodof claim 17 wherein the distributed memory logic circuit is operative toif a first plurality of contiguous bits of the binary address equal alogical node identifier of the node, change the first plurality ofcontiguous bits to the predetermined value; if the first plurality ofcontiguous bits of the memory address equal the predetermined value,change the first plurality of contiguous bits to the logical nodeidentifier of the node.